10 research outputs found

    Navigating Diverse Datasets in the Face of Uncertainty

    Get PDF
    When exploring big volumes of data, one of the challenging aspects is their diversity of origin. Multiple files that have not yet been ingested into a database system may contain information of interest to a researcher, who must curate, understand and sieve their content before being able to extract knowledge. Performance is one of the greatest difficulties in exploring these datasets. On the one hand, examining non-indexed, unprocessed files can be inefficient. On the other hand, any processing before its understanding introduces latency and potentially un- necessary work if the chosen schema matches poorly the data. We have surveyed the state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle data in-situ performantly. Another major difficulty is matching files from multiple origins since their schema and layout may not be compatible or properly documented. Most surveyed solutions overlook this problem, especially for numeric, uncertain data, as is typical in fields like astronomy. The main objective of our research is to assist data scientists during the exploration of unprocessed, numerical, raw data distributed across multiple files based solely on its intrinsic distribution. In this thesis, we first introduce the concept of Equally-Distributed Dependencies, which provides the foundations to match this kind of dataset. We propose PresQ, a novel algorithm that finds quasi-cliques on hypergraphs based on their expected statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can be assumed to be the same. Finally, we propose a two-sample statistical test based on Self-Organizing Maps (SOM). This method can outperform, in terms of power, other classifier-based two- sample tests, being in some cases comparable to kernel-based methods, with the advantage of being interpretable. Both PresQ and the SOM-based statistical test can provide insights that drive serendipitous discoveries

    Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping Study

    Get PDF
    When exploring big amounts of data without a clear target, providing an interactive experience becomes really dif cult, since this tentative inspection usually defeats any early decision on data structures or indexing strategies. This is also true in the physics domain, speci cally in high-energy physics, where the huge volume of data generated by the detectors are normally explored via C++ code using batch processing, which introduces a considerable latency. An interactive tool, when integrated into the existing data management systems, can add a great value to the usability of these platforms. Here, we intend to review the current state-of-the-art of interactive data exploration, aiming at satisfying three requirements: access to raw data les, stored in a distributed environment, and with a reasonably low latency. This paper follows the guidelines for systematic mapping studies, which is well suited for gathering and classifying available studies.We summarize the results after classifying the 242 papers that passed our inclusion criteria. While there are many proposed solutions that tackle the problem in different manners, there is little evidence available about their implementation in practice. Almost all of the solutions found by this paper cover a subset of our requirements, with only one partially satisfying the three. The solutions for data exploration abound. It is an active research area and, considering the continuous growth of data volume and variety, is only to become harder. There is a niche for research on a solution that covers our requirements, and the required building blocks are there

    FTS3 / WebFTS – A Powerful File Transfer Service for Scientific Communities

    Get PDF
    FTS3, the service responsible for globally distributing the majority of the LHC data across the WLCG infrastructure, is now available to everybody. Already integrated into LHC experiment frameworks, a new web interface now makes the FTS3's transfer technology directly available to end users. In this article we describe this intuitive new interface, “WebFTS”, which allows users to easily schedule and manage large data transfers right from the browser, profiting from a service which has been proven at the scale of petabytes per month. We will shed light on new development activities to extend FTS3 transfers capabilities outside Grid boundaries with support of non-Grid endpoints like Dropbox and S3. We also describe the latest changes integrated into the transfer engine itself, such as new data management operations like deletions and staging files from archive, all of which may be accessed through our standards-compliant REST API. For the Service Managers, we explain such features as the service's horizontal scalability, advanced monitoring and its “zero configuration” approach to deployment made possible by specialised transfer optimisation logic. For the Data Managers, we will present new tools for management of FTS3 transfer parameters like limits for bandwidth and max active file transfers per endpoint and VO, user and endpoint banning and powerful command line tools. We finish by describing our effort to extend WebFTS's captivating graphical interface with support of Federated Identity technologies, thus demonstrating the use of grid resources without the burden of certificate management. In this manner we show how FTS3 can cover the needs of wide range of parties from casual users to high-load services. The evolution of FTS3 is addressing technical and performance requirements and challenges for LHC Run 2, moreover, its simplicity, generic design, web portal and REST interface makes it an ideal file transfer scheduler both inside and outside of HEP community

    Towards an HTTP Ecosystem for HEP Data Access

    No full text
    In this contribution we present a vision for the use of the HTTP protocol for data access and data management in the context of HEP. The evolution of the DPM/LFC software stacks towards a modern framework that can be plugged into Apache servers triggered various initiatives that successfully demonstrated the use of HTTP-based protocols for data access, federation and transfer. This includes the evolution of the FTS3 system towards being able to manage third-party transfers using HTTP. Given the flexibility of the methods, the feature set may also include a subset of the SRM functionality that is relevant to disk systems. The application domain for such an ecosystem of services goes from large scale, Gridlike computing to the data access from laptops, profiting from tools that are shared with the Web community, like browsers, clients libraries and others. Particular focus was put into emphasizing the flexibility of the frameworks, which can interface with a very broad range of components, data stores, catalogues and metadata stores, including the possibility of building high performance dynamic federations of endpoints that build on the fly the feeling of a unique, seamless very efficient system. The overall goal is to leverage standards and standard practices, and use them to provide the higher level functionalities that are needed to fulfil the complex problem of Data Access in HEP. Other points of interest are about harmonizing the possibilities given by the HTTP/WebDAV protocols with existing frameworks like ROOT and already existing Storage Federations based on the XROOTD framework. We also provide quantitative evaluations of the performance that is achievable using HTTP for remote transfer and remote I/O in the context of HEP data. The idea is to contribute the parts that can make possible an ecosystem of services and applications, where the HEP-related features are covered, and the door is open to standard solutions and tools provided by third parties, in the context of the Web and Cloud technologies

    Making the most of cloud storage - a toolkit for exploitation by WLCG experiments

    No full text
    Understanding how cloud storage can be effectively used, either standalone or in support of its associated compute, is now an important consideration for WLCG. We report on a suite of extensions to familiar tools targeted at enabling the integration of cloud object stores into traditional grid infrastructures and workflows. Notable updates include support for a number of object store flavours in FTS3, Davix and gfal2, including mitigations for lack of vector reads; the extension of Dynafed to operate as a bridge between grid and cloud domains; protocol translation in FTS3; the implementation of extensions to DPM (also implemented by the dCache project) to allow 3rd party transfers over HTTP. The result is a toolkit which facilitates data movement and access between grid and cloud infrastructures, broadening the range of workflows suitable for cloud. We report on deployment scenarios and prototype experience, explaining how, for example, an Amazon S3 or Azure allocation can be exploited by grid workflows

    Dynamic federations: storage aggregation using open tools and protocols

    No full text
    A number of storage elements now offer standard protocol interfaces like NFS 4.1/pNFS and WebDAV, for access to their data repositories, in line with the standardization effort of the European Middleware Initiative (EMI). Also the LCG FileCatalogue (LFC) can offer such features. Here we report on work that seeks to exploit the federation potential of these protocols and build a system that offers a unique view of the storage and metadata ensemble and the possibility of integration of other compatible resources such as those from cloud providers. The challenge, here undertaken by the providers of dCache and DPM, and pragmatically open to other Grid and Cloud storage solutions, is to build such a system while being able to accommodate name translations from existing catalogues (e.g. LFCs), experiment- based metadata catalogues, or stateless algorithmic name translations, also known as ”trivial file catalogues”. Such so-called storage federations of standard protocols-based storage elements give a unique view of their content, thus promoting simplicity in accessing the data they contain and offering new possibilities for resilience and data placement strategies. The goal is to consider HTTP and NFS4.1-based storage elements and metadata catalogues and make them able to cooperate through an architecture that properly feeds the redirection mechanisms that they are based upon, thus giving the functionalities of a ”loosely coupled” storage federation. One of the key requirements is to use standard clients (provided by OS’es or open source distributions, e.g. Web browsers) to access an already aggregated system; this approach is quite different from aggregating the repositories at the client side through some wrapper API, like for instance GFAL, or by developing new custom clients. Other technical challenges that will determine the success of this initiative include performance, latency and scalability, and the ability to create worldwide storage federations that are able to redirect clients to repositories that they can efficiently access, for instance trying to choose the endpoints that are closer or applying other criteria. We believe that the features of a loosely coupled federation of open-protocols-based storage elements will open many possibilities of evolving the current computing models without disrupting them, and, at the same time, will be able to operate with the existing infrastructures, follow their evolution path and add storage centers that can be acquired as a third-party service

    Rare predicted loss-of-function variants of type I IFN immunity genes are associated with life-threatening COVID-19

    No full text
    BackgroundWe previously reported that impaired type I IFN activity, due to inborn errors of TLR3- and TLR7-dependent type I interferon (IFN) immunity or to autoantibodies against type I IFN, account for 15-20% of cases of life-threatening COVID-19 in unvaccinated patients. Therefore, the determinants of life-threatening COVID-19 remain to be identified in similar to 80% of cases.MethodsWe report here a genome-wide rare variant burden association analysis in 3269 unvaccinated patients with life-threatening COVID-19, and 1373 unvaccinated SARS-CoV-2-infected individuals without pneumonia. Among the 928 patients tested for autoantibodies against type I IFN, a quarter (234) were positive and were excluded.ResultsNo gene reached genome-wide significance. Under a recessive model, the most significant gene with at-risk variants was TLR7, with an OR of 27.68 (95%CI 1.5-528.7, P=1.1x10(-4)) for biochemically loss-of-function (bLOF) variants. We replicated the enrichment in rare predicted LOF (pLOF) variants at 13 influenza susceptibility loci involved in TLR3-dependent type I IFN immunity (OR=3.70[95%CI 1.3-8.2], P=2.1x10(-4)). This enrichment was further strengthened by (1) adding the recently reported TYK2 and TLR7 COVID-19 loci, particularly under a recessive model (OR=19.65[95%CI 2.1-2635.4], P=3.4x10(-3)), and (2) considering as pLOF branchpoint variants with potentially strong impacts on splicing among the 15 loci (OR=4.40[9%CI 2.3-8.4], P=7.7x10(-8)). Finally, the patients with pLOF/bLOF variants at these 15 loci were significantly younger (mean age [SD]=43.3 [20.3] years) than the other patients (56.0 [17.3] years; P=1.68x10(-5)).ConclusionsRare variants of TLR3- and TLR7-dependent type I IFN immunity genes can underlie life-threatening COVID-19, particularly with recessive inheritance, in patients under 60 years old
    corecore